Skip to content

feat: Support remote filesystem seeds#765

Open
mikeknep wants to merge 4 commits into
mainfrom
remote-seeds/mknepper
Open

feat: Support remote filesystem seeds#765
mikeknep wants to merge 4 commits into
mainfrom
remote-seeds/mknepper

Conversation

@mikeknep

@mikeknep mikeknep commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

📋 Summary

Adds support for injecting fsspec filesystems into DirectorySeedReader and FileContentsSeedReader so that they can be used in non-local contexts

🔗 Related Issue

Implements this plan

🔄 Changes

  • Introduces FileSystemProvider(Protocol) and a default implementation LocalFileSystemProvider, adding a seam where previously the local filesystem was effectively hardcoded
  • Updates FileSystemSeedSource and its subclasses to not validate dir/file existence upon config object creation, instead deferring that check to validation/read times
  • Refactors FileSystemSeedSource and its subclasses

🧪 Testing

  • make test passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable) -- none added, but existing tests pass
  • Added an implementation for the Nemo Platform Data Designer plugin that uses the Files service's fsspec filesystem and confirmed Directory- and FileContents seeds work. See feat: Support more Data Designer seed sources nemo-platform#413

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

mikeknep added 2 commits June 23, 2026 13:12
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
@mikeknep mikeknep requested a review from a team as a code owner June 23, 2026 18:20
@github-actions

Copy link
Copy Markdown
Contributor

Fern preview: https://nvidia-preview-pr-765.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces a FileSystemProvider protocol with a default LocalFileSystemProvider implementation, decoupling DirectorySeedReader and FileContentsSeedReader from the local filesystem so remote fsspec filesystems can be injected. As a consequence, directory existence validation is deferred from config-construction time to read time, and relative paths are now resolved by the provider at read time rather than at load time.

  • FileSystemProvider / LocalFileSystemProvider added to seed_reader.py; FileSystemSeedReader gains a constructor parameter and delegates ensure_root_exists + create_context to the provider before (and instead of) the old hardcoded DirFileSystem(LocalFileSystem()) path.
  • FileSystemSeedSource.runtime_path simplified to return self.path as-is; AgentRolloutSeedSource.runtime_path similarly drops load-time resolution, falling back to the format default only when path is None.
  • SeedReaderConfigError introduced and caught in the compiler's _resolve_and_add_seed_columns, converting missing-root errors into InvalidConfigError during the compile phase.

Confidence Score: 5/5

Safe to merge — the filesystem abstraction is well-contained, the behavior changes are intentional and covered by tests, and the error propagation path through the compiler is correct.

The refactor is internally consistent: existence checks moved to the provider's ensure_root_exists, raw paths flow through runtime_path, and the compiler's error conversion is tested end-to-end. The AgentRolloutSeedReader type-narrowing guard correctly prevents misuse by remote providers. No logic errors or data-loss paths found.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py Introduces FileSystemProvider protocol + LocalFileSystemProvider default, defers existence checks to read time, adds SeedReaderConfigError, and narrows AgentRolloutSeedReader to require a concrete Path
packages/data-designer-config/src/data_designer/config/seed_source.py Removes load-time directory existence validation and path resolution from FileSystemSeedSource; runtime_path now returns the raw path string for resolution by the provider at read time
packages/data-designer-engine/src/data_designer/engine/compiler.py Wraps get_column_names() in a SeedReaderConfigError catch that converts to InvalidConfigError, propagating filesystem validation failures cleanly to the compile phase
packages/data-designer-config/tests/config/test_seed_source.py Updated and expanded tests to cover deferred validation, raw runtime_path preservation, AgentRollout fallback defaults, and plugin subclass inheritance of runtime_path
packages/data-designer-engine/tests/engine/resources/test_seed_reader.py Updates CWD-change test to assert read-time resolution (beta.txt from later_seed_dir rather than alpha.txt from initial_seed_dir), and adds a test for missing-root error propagation
packages/data-designer-engine/tests/engine/test_compiler.py Adds test verifying SeedReaderConfigError raised from get_column_names is re-raised as InvalidConfigError with the original as cause
fern/versions/latest/pages/concepts/seed-datasets.mdx Documentation updated to note that relative local paths are resolved by the active filesystem provider at validation/read time, not at config construction

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller
    participant FileSystemSeedReader
    participant FileSystemProvider
    participant SeedReaderFileSystemContext

    Caller->>FileSystemSeedReader: get_column_names()
    FileSystemSeedReader->>FileSystemSeedReader: _get_filesystem_context()
    FileSystemSeedReader->>FileSystemProvider: ensure_root_exists(runtime_path)
    alt path does not exist
        FileSystemProvider-->>FileSystemSeedReader: raise SeedReaderConfigError
        FileSystemSeedReader-->>Caller: raise SeedReaderConfigError
    end
    FileSystemSeedReader->>FileSystemSeedReader: create_filesystem_context(runtime_path)
    FileSystemSeedReader->>FileSystemProvider: create_context(runtime_path)
    FileSystemProvider-->>FileSystemSeedReader: SeedReaderFileSystemContext(fs, root_path)
    FileSystemSeedReader->>SeedReaderFileSystemContext: build_manifest(context)
    SeedReaderFileSystemContext-->>FileSystemSeedReader: manifest rows
    FileSystemSeedReader-->>Caller: column names
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller
    participant FileSystemSeedReader
    participant FileSystemProvider
    participant SeedReaderFileSystemContext

    Caller->>FileSystemSeedReader: get_column_names()
    FileSystemSeedReader->>FileSystemSeedReader: _get_filesystem_context()
    FileSystemSeedReader->>FileSystemProvider: ensure_root_exists(runtime_path)
    alt path does not exist
        FileSystemProvider-->>FileSystemSeedReader: raise SeedReaderConfigError
        FileSystemSeedReader-->>Caller: raise SeedReaderConfigError
    end
    FileSystemSeedReader->>FileSystemSeedReader: create_filesystem_context(runtime_path)
    FileSystemSeedReader->>FileSystemProvider: create_context(runtime_path)
    FileSystemProvider-->>FileSystemSeedReader: SeedReaderFileSystemContext(fs, root_path)
    FileSystemSeedReader->>SeedReaderFileSystemContext: build_manifest(context)
    SeedReaderFileSystemContext-->>FileSystemSeedReader: manifest rows
    FileSystemSeedReader-->>Caller: column names
Loading

Reviews (3): Last reviewed commit: "Fix stale docstring" | Re-trigger Greptile

mikeknep added 2 commits June 23, 2026 14:27
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
Signed-off-by: Mike Knepper <mknepper@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant